June 19, 2018

Aims of workshop:

  • Learn the main dplyr verbs
  • Understand how pipes work

How the practical works?

  • Practical is on RStudio cloud : https://rstudio.cloud/project/42122

  • It needs an account, but these are free!

  • There is an Rscript and a Rmarkdown file
    • Work with whatever one you are most comfortable with

The data we are going to be working with

  1. gapminder
    • "An excerpt of the data available at Gapminder.org. For each of 142 countries, the package provides values for life expectancy, GDP per capita, and population, every five years, from 1952 to 2007."
    • For more information : https://www.gapminder.org/
  2. msleep
    • Sleep times and weights for mammals
    • For more information: V. M. Savage and G. B. West. A quantitative, theoretical framework for understanding mammalian sleep. Proceedings of the National Academy of Sciences, 104 (3):1051-1056, 2007.

Why dplyr?

  • It's fast
  • Simple clear syntax
  • Chain commands together
  • Connects with external databases

Tidy Data

  • What is "tidy" data?
    • Observations as rows
    • Variables as columns
    • One type of observational unit per table
    • Each value belongs to a variable and an observation
    • Column headers are variable names not variables

Understanding the structure of your data

  • Various functions exist to look at the data and give you basic information about it.
    • head(df) \(\rightarrow\) prints to screen first few rows of data
    • tail(df) \(\rightarrow\) prints to screen last few rows of data
    • class(df) \(\rightarrow\) class of data
    • dim(df) \(\rightarrow\) how many rows and columns
    • colnames(df) \(\rightarrow\) column names
    • str(df) \(\rightarrow\) data preview, and column data types
    • summary(df) \(\rightarrow\) summary of data
  • df == the name of your dataframe

Let's look at gapminder

head(gapminder, n = 2)
## # A tibble: 2 x 6
##   country     continent  year lifeExp     pop gdpPercap
##   <fct>       <fct>     <int>   <dbl>   <int>     <dbl>
## 1 Afghanistan Asia       1952    28.8 8425333      779.
## 2 Afghanistan Asia       1957    30.3 9240934      821.
str(gapminder)
## Classes 'tbl_df', 'tbl' and 'data.frame':    1704 obs. of  6 variables:
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
##  $ pop      : int  8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
##  $ gdpPercap: num  779 821 853 836 740 ...
dim(gapminder)
## [1] 1704    6

Let's a look at the practical

dplyr

dplyr 'verbs'

  • 6 main "verbs" in dplyr
    • filter() \(\rightarrow\) return rows with matching conditions
    • select() \(\rightarrow\) select variables by name
    • mutate() \(\rightarrow\) create new variables
    • group_by() \(\rightarrow\) return grouping variables
    • summarise() \(\rightarrow\) reduces multiple values down to a single value
    • arrange() \(\rightarrow\) arrange rows by variables

filter() \(\rightarrow\) filter rows

  • Use filter to keep or remove certain rows from a data set.
  • Use conditionals to keep only the rows that match the criteria.
# single filter 
filter(gapminder, year == 1997)

filter() \(\rightarrow\) Strings

Strings are characters/words/letters

Strings need to be in quotes

filter(gapminder, 
       country == "United Kingdom")
## # A tibble: 12 x 6
##    country        continent  year lifeExp      pop gdpPercap
##    <fct>          <fct>     <int>   <dbl>    <int>     <dbl>
##  1 United Kingdom Europe     1952    69.2 50430000     9980.
##  2 United Kingdom Europe     1957    70.4 51430000    11283.
##  3 United Kingdom Europe     1962    70.8 53292000    12477.
##  4 United Kingdom Europe     1967    71.4 54959000    14143.
##  5 United Kingdom Europe     1972    72.0 56079000    15895.
##  6 United Kingdom Europe     1977    72.8 56179000    17429.
##  7 United Kingdom Europe     1982    74.0 56339704    18232.
##  8 United Kingdom Europe     1987    75.0 56981620    21665.
##  9 United Kingdom Europe     1992    76.4 57866349    22705.
## 10 United Kingdom Europe     1997    77.2 58808266    26075.
## 11 United Kingdom Europe     2002    78.5 59912431    29479.
## 12 United Kingdom Europe     2007    79.4 60776238    33203.

filter() \(\rightarrow\) Relational operations

Relational Operations Outputs TRUE or FALSE based on the validity of the statement.
a == b a is equal to b
a != b a is not equal to b
a > b a is greater than b
a < b a is less than b
a >= b a is greater than or equal to b
a <= b a is less than or equal to b
a %in% b a is an element in b

Table from http://stat545.com/cm005-notes_and_exercises.html

filter() \(\rightarrow\) %in%

%in% - value match - Does this exist in that

Filter year to only keep those that are either 1977 or 1997

filter(gapminder, 
       year %in% c(1977, 1997))
## # A tibble: 284 x 6
##    country     continent  year lifeExp      pop gdpPercap
##    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
##  1 Afghanistan Asia       1977    38.4 14880372      786.
##  2 Afghanistan Asia       1997    41.8 22227415      635.
##  3 Albania     Europe     1977    68.9  2509048     3533.
##  4 Albania     Europe     1997    73.0  3428038     3193.
##  5 Algeria     Africa     1977    58.0 17152804     4910.
##  6 Algeria     Africa     1997    69.2 29072015     4797.
##  7 Angola      Africa     1977    39.5  6162675     3009.
##  8 Angola      Africa     1997    41.0  9875024     2277.
##  9 Argentina   Americas   1977    68.5 26983828    10079.
## 10 Argentina   Americas   1997    73.3 36203463    10967.
## # ... with 274 more rows

filter() \(\rightarrow\) Multiple filters

Multiple filters may be separated by a , (comma)

dplyr interprets the , (comma) to be & (and)

filter(gapminder, 
          year == 1997, lifeExp > 78) 
## # A tibble: 13 x 6
##    country          continent  year lifeExp       pop gdpPercap
##    <fct>            <fct>     <int>   <dbl>     <int>     <dbl>
##  1 Australia        Oceania    1997    78.8  18565243    26998.
##  2 Canada           Americas   1997    78.6  30305843    28955.
##  3 France           Europe     1997    78.6  58623428    25890.
##  4 Hong Kong, China Asia       1997    80     6495918    28378.
##  5 Iceland          Europe     1997    79.0    271192    28061.
##  6 Israel           Asia       1997    78.3   5531387    20897.
##  7 Italy            Europe     1997    78.8  57479469    24675.
##  8 Japan            Asia       1997    80.7 125956499    28817.
##  9 Netherlands      Europe     1997    78.0  15604464    30246.
## 10 Norway           Europe     1997    78.3   4405672    41283.
## 11 Spain            Europe     1997    78.8  39855442    20445.
## 12 Sweden           Europe     1997    79.4   8897619    25267.
## 13 Switzerland      Europe     1997    79.4   7193761    32135.

filter() \(\rightarrow\) Relational operations

Logical Operations Outputs TRUE or FALSE based on the validity of the statement.
a & b, a && b Both a and b are TRUE
a b, a
!a a is not TRUE (take everything else but a)
xor(a, b) Either a or b is TRUE, but not both.
all(a,b,c,…) a, b, c, . are all TRUE.
any(a,b,c,…) Any one of a, b, c, . is TRUE.

Table from http://stat545.com/cm005-notes_and_exercises.html

filter() \(\rightarrow\) OR

Keep a row if either criteria is met

filter(gapminder, 
       pop > 350000000 | lifeExp > 80)
## # A tibble: 45 x 6
##    country   continent  year lifeExp        pop gdpPercap
##    <fct>     <fct>     <int>   <dbl>      <int>     <dbl>
##  1 Australia Oceania    2002    80.4   19546792    30688.
##  2 Australia Oceania    2007    81.2   20434176    34435.
##  3 Canada    Americas   2007    80.7   33390141    36319.
##  4 China     Asia       1952    44    556263527      400.
##  5 China     Asia       1957    50.5  637408000      576.
##  6 China     Asia       1962    44.5  665770000      488.
##  7 China     Asia       1967    58.4  754550000      613.
##  8 China     Asia       1972    63.1  862030000      677.
##  9 China     Asia       1977    64.0  943455000      741.
## 10 China     Asia       1982    65.5 1000281000      962.
## # ... with 35 more rows

filter() \(\rightarrow\) between()

between() can be used to find observations between a range.

Instead of writing this:

filter(gapminder, 
       year >= 1977 | year <= 1997)

We can do this

filter(gapminder, 
       between(year, 1977, 1997) )
## # A tibble: 710 x 6
##    country     continent  year lifeExp      pop gdpPercap
##    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
##  1 Afghanistan Asia       1977    38.4 14880372      786.
##  2 Afghanistan Asia       1982    39.9 12881816      978.
##  3 Afghanistan Asia       1987    40.8 13867957      852.
##  4 Afghanistan Asia       1992    41.7 16317921      649.
##  5 Afghanistan Asia       1997    41.8 22227415      635.
##  6 Albania     Europe     1977    68.9  2509048     3533.
##  7 Albania     Europe     1982    70.4  2780097     3631.
##  8 Albania     Europe     1987    72    3075321     3739.
##  9 Albania     Europe     1992    71.6  3326498     2497.
## 10 Albania     Europe     1997    73.0  3428038     3193.
## # ... with 700 more rows

filter() \(\rightarrow\) Be careful on order!

R will run the & before the |

Keep only rows where country is Canada or France and the year is 1982

filter(gapminder, 
       country == 'Canada' | country == 'France' & year == 1982 )
## # A tibble: 13 x 6
##    country continent  year lifeExp      pop gdpPercap
##    <fct>   <fct>     <int>   <dbl>    <int>     <dbl>
##  1 Canada  Americas   1952    68.8 14785584    11367.
##  2 Canada  Americas   1957    70.0 17010154    12490.
##  3 Canada  Americas   1962    71.3 18985849    13462.
##  4 Canada  Americas   1967    72.1 20819767    16077.
##  5 Canada  Americas   1972    72.9 22284500    18971.
##  6 Canada  Americas   1977    74.2 23796400    22091.
##  7 Canada  Americas   1982    75.8 25201900    22899.
##  8 Canada  Americas   1987    76.9 26549700    26627.
##  9 Canada  Americas   1992    78.0 28523502    26343.
## 10 Canada  Americas   1997    78.6 30305843    28955.
## 11 Canada  Americas   2002    79.8 31902268    33329.
## 12 Canada  Americas   2007    80.7 33390141    36319.
## 13 France  Europe     1982    74.9 54433565    20294.

filter() \(\rightarrow\) Be careful on order!

R will run the & before the |

Keep only rows where country is Canada or France and the year is 1982

filter(gapminder, 
       (country == 'Canada' | country == 'France') & year == 1982)
## # A tibble: 2 x 6
##   country continent  year lifeExp      pop gdpPercap
##   <fct>   <fct>     <int>   <dbl>    <int>     <dbl>
## 1 Canada  Americas   1982    75.8 25201900    22899.
## 2 France  Europe     1982    74.9 54433565    20294.

filter() \(\rightarrow\) Remove Missing Data

Use !is.na() to filter out missing data

head(msleep, n = 2)
## # A tibble: 2 x 11
##   name   genus  vore  order conservation sleep_total sleep_rem sleep_cycle
##   <chr>  <chr>  <chr> <chr> <chr>              <dbl>     <dbl>       <dbl>
## 1 Cheet~ Acino~ carni Carn~ lc                  12.1      NA            NA
## 2 Owl m~ Aotus  omni  Prim~ <NA>                17         1.8          NA
## # ... with 3 more variables: awake <dbl>, brainwt <dbl>, bodywt <dbl>
filter(msleep, 
       !is.na(conservation))
## # A tibble: 54 x 11
##    name   genus vore  order conservation sleep_total sleep_rem sleep_cycle
##    <chr>  <chr> <chr> <chr> <chr>              <dbl>     <dbl>       <dbl>
##  1 Cheet~ Acin~ carni Carn~ lc                  12.1      NA        NA    
##  2 Mount~ Aplo~ herbi Rode~ nt                  14.4       2.4      NA    
##  3 Great~ Blar~ omni  Sori~ lc                  14.9       2.3       0.133
##  4 Cow    Bos   herbi Arti~ domesticated         4         0.7       0.667
##  5 North~ Call~ carni Carn~ vu                   8.7       1.4       0.383
##  6 Dog    Canis carni Carn~ domesticated        10.1       2.9       0.333
##  7 Roe d~ Capr~ herbi Arti~ lc                   3        NA        NA    
##  8 Goat   Capri herbi Arti~ lc                   5.3       0.6      NA    
##  9 Guine~ Cavis herbi Rode~ domesticated         9.4       0.8       0.217
## 10 Grivet Cerc~ omni  Prim~ lc                  10         0.7      NA    
## # ... with 44 more rows, and 3 more variables: awake <dbl>, brainwt <dbl>,
## #   bodywt <dbl>

Let's try filter()

# single filter 
filter(gapminder, year == 1997)

filter(gapminder, continent == "Asia")

 # multiple filters 
filter(gapminder, year == 1997 & gdpPercap > 1000) 

filter(gapminder, year %in% c(1977, 1997)) 
  • Column names must match EXACTLY to the dataframe
  • Observations that are strings (characters/words) need quotes
  • R is case sensitive!

select() \(\rightarrow\) select columns

Use select to keep or remove certain columns from a data sets.

select(gapminder, country, year, lifeExp)

select() \(\rightarrow\) select columns vectorized

Select all columns from country to life expectancy (lifeExp)

select(gapminder, country:lifeExp)  
## # A tibble: 1,704 x 4
##    country     continent  year lifeExp
##    <fct>       <fct>     <int>   <dbl>
##  1 Afghanistan Asia       1952    28.8
##  2 Afghanistan Asia       1957    30.3
##  3 Afghanistan Asia       1962    32.0
##  4 Afghanistan Asia       1967    34.0
##  5 Afghanistan Asia       1972    36.1
##  6 Afghanistan Asia       1977    38.4
##  7 Afghanistan Asia       1982    39.9
##  8 Afghanistan Asia       1987    40.8
##  9 Afghanistan Asia       1992    41.7
## 10 Afghanistan Asia       1997    41.8
## # ... with 1,694 more rows

select() \(\rightarrow\) not select a column

Select all columns BUT country

Using - to not select a specific column

Can also be vectorized or used with helper functions

select(gapminder, -country) 
## # A tibble: 1,704 x 5
##    continent  year lifeExp      pop gdpPercap
##    <fct>     <int>   <dbl>    <int>     <dbl>
##  1 Asia       1952    28.8  8425333      779.
##  2 Asia       1957    30.3  9240934      821.
##  3 Asia       1962    32.0 10267083      853.
##  4 Asia       1967    34.0 11537966      836.
##  5 Asia       1972    36.1 13079460      740.
##  6 Asia       1977    38.4 14880372      786.
##  7 Asia       1982    39.9 12881816      978.
##  8 Asia       1987    40.8 13867957      852.
##  9 Asia       1992    41.7 16317921      649.
## 10 Asia       1997    41.8 22227415      635.
## # ... with 1,694 more rows

select() \(\rightarrow\) Helper Functions

  • Make it easier to select multiple columns
    • starts_with("Hap") \(\rightarrow\) ALL column names that start with Hap
    • ends_with("ppy") \(\rightarrow\) ALL column names that end with ppy
    • contains("app") \(\rightarrow\) ALL column names that contain app
    • matches("^(SNP|CHR)_") \(\rightarrow\) ALL column names that match a regular expression
    • num_range("x", 1980:1983) \(\rightarrow\) ALL column names x1980, x1981, x1982, and x1983
    • one_of(character_vector) \(\rightarrow\) ALL column names that appear in character_vector

select() \(\rightarrow\) Helper Functions - Example

Select all columns that start with sleep

select(msleep, starts_with("sleep"))
## # A tibble: 83 x 3
##    sleep_total sleep_rem sleep_cycle
##          <dbl>     <dbl>       <dbl>
##  1        12.1      NA        NA    
##  2        17         1.8      NA    
##  3        14.4       2.4      NA    
##  4        14.9       2.3       0.133
##  5         4         0.7       0.667
##  6        14.4       2.2       0.767
##  7         8.7       1.4       0.383
##  8         7        NA        NA    
##  9        10.1       2.9       0.333
## 10         3        NA        NA    
## # ... with 73 more rows

select() \(\rightarrow\) Helper Functions - one_of()

Allows you to create a character vector of column names and use it to select columns from dataframe

to_keep <- c('country', 'lifeExp', 'gdpPercap')

select(gapminder, one_of(to_keep))
## # A tibble: 1,704 x 3
##    country     lifeExp gdpPercap
##    <fct>         <dbl>     <dbl>
##  1 Afghanistan    28.8      779.
##  2 Afghanistan    30.3      821.
##  3 Afghanistan    32.0      853.
##  4 Afghanistan    34.0      836.
##  5 Afghanistan    36.1      740.
##  6 Afghanistan    38.4      786.
##  7 Afghanistan    39.9      978.
##  8 Afghanistan    40.8      852.
##  9 Afghanistan    41.7      649.
## 10 Afghanistan    41.8      635.
## # ... with 1,694 more rows

select() \(\rightarrow\) Change column names

  • 2 different methods with different outcome:
    • using select() \(\rightarrow\) only keeps columns listed in select statement
    • using rename() \(\rightarrow\) keep all columns
select(gapminder, Life_Expectancy = lifeExp)
## # A tibble: 1,704 x 1
##    Life_Expectancy
##              <dbl>
##  1            28.8
##  2            30.3
##  3            32.0
##  4            34.0
##  5            36.1
##  6            38.4
##  7            39.9
##  8            40.8
##  9            41.7
## 10            41.8
## # ... with 1,694 more rows

select() \(\rightarrow\) Change column names

  • 2 different methods with different outcome:
    • using select() \(\rightarrow\) only keeps columns listed in select statement
    • using rename() \(\rightarrow\) keep all columns
rename(gapminder, Life_Expectancy = lifeExp)
## # A tibble: 1,704 x 6
##    country     continent  year Life_Expectancy      pop gdpPercap
##    <fct>       <fct>     <int>           <dbl>    <int>     <dbl>
##  1 Afghanistan Asia       1952            28.8  8425333      779.
##  2 Afghanistan Asia       1957            30.3  9240934      821.
##  3 Afghanistan Asia       1962            32.0 10267083      853.
##  4 Afghanistan Asia       1967            34.0 11537966      836.
##  5 Afghanistan Asia       1972            36.1 13079460      740.
##  6 Afghanistan Asia       1977            38.4 14880372      786.
##  7 Afghanistan Asia       1982            39.9 12881816      978.
##  8 Afghanistan Asia       1987            40.8 13867957      852.
##  9 Afghanistan Asia       1992            41.7 16317921      649.
## 10 Afghanistan Asia       1997            41.8 22227415      635.
## # ... with 1,694 more rows

Let's try select()

select(gapminder, country, year, lifeExp)

# vectorized 
select(gapminder, country:year)  
# select all columns from country to year 

# Not select a column 
select(gapminder, -country) 
 # select all columns BUT country 

select(gapminder, contains("co"))
  • Column names must match EXACTLY to the dataframe
  • Observations that are strings (characters/words) need quotes
  • R is case sensitive!

Introducing the pipe %>%

  • Commands in dplyr can be piped together
    • Saves on writing coding and intermediate steps
# multiple steps with intermediate dataframes created
new_df <-  filter(gapminder, year == 1997) 
select(new_df, -year) 

# nested commands
select(
  filter(gapminder, 
         year == 1997), 
  -year)

# using a pipe %>%
gapminder %>%
  filter(year == 1997) %>%
  select(-year)   

The pipe %>% takes the output from the previous command as the input the next command

pipe %>%

Input data information for each command is now missing

The input is the output of the function above!

subsequent "piped" commands are indented - makes code easier to read

gapminder %>%
  filter(year == 1997) %>%
  select(-year)   
## # A tibble: 142 x 5
##    country     continent lifeExp       pop gdpPercap
##    <fct>       <fct>       <dbl>     <int>     <dbl>
##  1 Afghanistan Asia         41.8  22227415      635.
##  2 Albania     Europe       73.0   3428038     3193.
##  3 Algeria     Africa       69.2  29072015     4797.
##  4 Angola      Africa       41.0   9875024     2277.
##  5 Argentina   Americas     73.3  36203463    10967.
##  6 Australia   Oceania      78.8  18565243    26998.
##  7 Austria     Europe       77.5   8069876    29096.
##  8 Bahrain     Asia         73.9    598561    20292.
##  9 Bangladesh  Asia         59.4 123315288      973.
## 10 Belgium     Europe       77.5  10199787    27561.
## # ... with 132 more rows

pipe %>%

  • When to not use a pipe:
    • Multiple inputs or outputs
      • Pipes should focus on one clear task for one object
    • More than 10 steps
    • If you need a circular step
      • Pipes are linear, they don't loop back round!

Let's try the pipe %>%

gapminder %>%
  filter(year == 1997) %>%
  select(-year)   
  • With %>%, there is no more input data information for each command
  • Indentation makes it easier to follow the trail of commands within the pipe
  • Column names must match EXACTLY to the dataframe
  • Observations that are strings (characters/words) need quotes
  • R is case sensitive!

mutate() \(\rightarrow\) Mutate existing columns to make new columns

gdp_billion is the name of the newly created variable

= gdpPercap*pop/10^9 is the equation to make the new variable

mutate(gapminder, gdp_billion = gdpPercap*pop/10^9) 

mutate() \(\rightarrow\) if you don't set a name

There is no column name set for the new variable - defaults to the equation

= gdpPercap*pop/10^9 is the equation to make the new variable

mutate(gapminder, gdpPercap*pop/10^9) 
## # A tibble: 1,704 x 7
##    country     continent  year lifeExp      pop gdpPercap `gdpPercap * po~
##    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>            <dbl>
##  1 Afghanistan Asia       1952    28.8  8425333      779.             6.57
##  2 Afghanistan Asia       1957    30.3  9240934      821.             7.59
##  3 Afghanistan Asia       1962    32.0 10267083      853.             8.76
##  4 Afghanistan Asia       1967    34.0 11537966      836.             9.65
##  5 Afghanistan Asia       1972    36.1 13079460      740.             9.68
##  6 Afghanistan Asia       1977    38.4 14880372      786.            11.7 
##  7 Afghanistan Asia       1982    39.9 12881816      978.            12.6 
##  8 Afghanistan Asia       1987    40.8 13867957      852.            11.8 
##  9 Afghanistan Asia       1992    41.7 16317921      649.            10.6 
## 10 Afghanistan Asia       1997    41.8 22227415      635.            14.1 
## # ... with 1,694 more rows

Make sure you set column names when using mutate()

mutate() \(\rightarrow\) Multiple columns

Each new column is separated by a , (comma)

mutate(gapminder, 
       gdp_billion = gdpPercap*pop/10^9, 
       life_years = lifeExp * pop, 
       life_gdp = life_years / gdp_billion) 
## # A tibble: 1,704 x 9
##    country continent  year lifeExp    pop gdpPercap gdp_billion life_years
##    <fct>   <fct>     <int>   <dbl>  <int>     <dbl>       <dbl>      <dbl>
##  1 Afghan~ Asia       1952    28.8 8.43e6      779.        6.57 242658016.
##  2 Afghan~ Asia       1957    30.3 9.24e6      821.        7.59 280296010.
##  3 Afghan~ Asia       1962    32.0 1.03e7      853.        8.76 328515855.
##  4 Afghan~ Asia       1967    34.0 1.15e7      836.        9.65 392521603.
##  5 Afghan~ Asia       1972    36.1 1.31e7      740.        9.68 472011552.
##  6 Afghan~ Asia       1977    38.4 1.49e7      786.       11.7  571971739.
##  7 Afghan~ Asia       1982    39.9 1.29e7      978.       12.6  513391895.
##  8 Afghan~ Asia       1987    40.8 1.39e7      852.       11.8  566117741.
##  9 Afghan~ Asia       1992    41.7 1.63e7      649.       10.6  680033040.
## 10 Afghan~ Asia       1997    41.8 2.22e7      635.       14.1  928283533.
## # ... with 1,694 more rows, and 1 more variable: life_gdp <dbl>

Can use newly created variables IMMEDIATELY!!

Let's try mutate()

# single variables
mutate(gapminder, gdp_billion = gdpPercap*pop/10^9) 

# multiple variables 
mutate(gapminder, 
       gdp_billion = gdpPercap*pop/10^9, 
       life_years = lifeExp * pop, 
       life_gdp = life_years / gdp_billion) 
  • Name your new columns for mutate()!
  • With %>%, there is no more input data information for each command
  • Indentation makes it easier to follow the trail of commands within the pipe
  • Column names must match EXACTLY to the dataframe
  • Observations that are strings (characters/words) need quotes
  • R is case sensitive!

Practical will build on previous verbs and the pipe!

arrange() \(\rightarrow\) sort rows

Single parameter ascending arrange

# ascending sort
arrange(gapminder, lifeExp)

# descending sort
arrange(gapminder, desc(lifeExp)) 

arrange() \(\rightarrow\) Multiple parameters

  • In a multiple parameter arrange, the order is important!
arrange(gapminder, desc(lifeExp), pop) 
## # A tibble: 1,704 x 6
##    country          continent  year lifeExp       pop gdpPercap
##    <fct>            <fct>     <int>   <dbl>     <int>     <dbl>
##  1 Japan            Asia       2007    82.6 127467972    31656.
##  2 Hong Kong, China Asia       2007    82.2   6980412    39725.
##  3 Japan            Asia       2002    82   127065841    28605.
##  4 Iceland          Europe     2007    81.8    301931    36181.
##  5 Switzerland      Europe     2007    81.7   7554661    37506.
##  6 Hong Kong, China Asia       2002    81.5   6762476    30209.
##  7 Australia        Oceania    2007    81.2  20434176    34435.
##  8 Spain            Europe     2007    80.9  40448191    28821.
##  9 Sweden           Europe     2007    80.9   9031088    33860.
## 10 Israel           Asia       2007    80.7   6426679    25523.
## # ... with 1,694 more rows
  • In the example above, the gapminder data is:
    • First sorted by descending life expectancy (desc(lifeExp))
    • Any ties broken by ascending population (pop)

Let's try arrange()

# single parameter ascending arrange 
arrange(gapminder, lifeExp)

 # single parameter descending arrange 
arrange(gapminder, desc(lifeExp)) 

# multiple parameter arrange 
arrange(gapminder, desc(lifeExp), pop) 
  • Name your new columns for mutate()!
  • With %>%, there is no more input data information for each command
  • Indentation makes it easier to follow the trail of commands within the pipe
  • Column names must match EXACTLY to the dataframe
  • Observations that are strings (characters/words) need quotes
  • R is case sensitive!

Practical will build on previous verbs and the pipe!

group_by() \(\rightarrow\) Group by factor/category

Group together rows of data by some factor within the dataset.

Alone, it is hard to see what this command is doing

BUT it is INCREDIBLY useful and VERY powerful

group_by(gapminder, continent)
## # A tibble: 1,704 x 6
## # Groups:   continent [5]
##    country     continent  year lifeExp      pop gdpPercap
##    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
##  1 Afghanistan Asia       1952    28.8  8425333      779.
##  2 Afghanistan Asia       1957    30.3  9240934      821.
##  3 Afghanistan Asia       1962    32.0 10267083      853.
##  4 Afghanistan Asia       1967    34.0 11537966      836.
##  5 Afghanistan Asia       1972    36.1 13079460      740.
##  6 Afghanistan Asia       1977    38.4 14880372      786.
##  7 Afghanistan Asia       1982    39.9 12881816      978.
##  8 Afghanistan Asia       1987    40.8 13867957      852.
##  9 Afghanistan Asia       1992    41.7 16317921      649.
## 10 Afghanistan Asia       1997    41.8 22227415      635.
## # ... with 1,694 more rows

group_by() \(\rightarrow\) Group by factor/category

Once a dataset is grouped, dplyr performs all operations on that group

Its like each group has become its own dataset

Image taken from Software Carpentry Foundation (https://software-carpentry.org/)

group_by() \(\rightarrow\) Multiple variables

Each column that you want to group by is listed separated by a , (comma)

Order is important!

Below will first group by country and then continent

group_by(gapminder, country, continent)
## # A tibble: 1,704 x 6
## # Groups:   country, continent [142]
##    country     continent  year lifeExp      pop gdpPercap
##    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
##  1 Afghanistan Asia       1952    28.8  8425333      779.
##  2 Afghanistan Asia       1957    30.3  9240934      821.
##  3 Afghanistan Asia       1962    32.0 10267083      853.
##  4 Afghanistan Asia       1967    34.0 11537966      836.
##  5 Afghanistan Asia       1972    36.1 13079460      740.
##  6 Afghanistan Asia       1977    38.4 14880372      786.
##  7 Afghanistan Asia       1982    39.9 12881816      978.
##  8 Afghanistan Asia       1987    40.8 13867957      852.
##  9 Afghanistan Asia       1992    41.7 16317921      649.
## 10 Afghanistan Asia       1997    41.8 22227415      635.
## # ... with 1,694 more rows

summarise() \(\rightarrow\) Create summary statistics

summarize() also works

# get mean of life expectancy 
summarise(gapminder, mean_lifexp = mean(lifeExp))
## # A tibble: 1 x 1
##   mean_lifexp
##         <dbl>
## 1        59.5

summarise() returns a dataframe

  • Need to name new columns
    • mean_lifexp is the column name
    • mean(lifeExp) is what populates the column
    • Exact same idea as with mutate()

summarise() \(\rightarrow\) Multiple outputs

summarize() also works

Each summary statistic is separated by a , (comma)

# get mean of life expectancy and mean of GDP 
summarise(gapminder, 
          mean_lifexp = mean(lifeExp), 
          mean_gdp = mean(gdpPercap))
## # A tibble: 1 x 2
##   mean_lifexp mean_gdp
##         <dbl>    <dbl>
## 1        59.5    7215.

summarise() \(\rightarrow\) options

These are not the only functions that work with summarise()

Function Description
min() minimum
max() maximum
sum() sum
range() range
mean() mean
median() median
sd() standard deviation
var() variance

See http://www.statmethods.net/management/functions.html for more information/options

summarise() \(\rightarrow\) n() and n_distinct()

n() counts number of observations

n_distinct() counts the number of distinct observation for that column

gapminder %>% 
  summarise(n = n(), 
            n_distinct = n_distinct(country)) 
## # A tibble: 1 x 2
##       n n_distinct
##   <int>      <int>
## 1  1704        142

group_by() & summarise()

Using group_by() with summarise() returns the summary statistic for each member of the group

gapminder %>% 
    group_by(continent) %>% 
    summarise(mean_lifeExp = mean(lifeExp), 
                median_gdp = median(gdpPercap)) 
## # A tibble: 5 x 3
##   continent mean_lifeExp median_gdp
##   <fct>            <dbl>      <dbl>
## 1 Africa            48.9      1192.
## 2 Americas          64.7      5466.
## 3 Asia              60.1      2647.
## 4 Europe            71.9     12082.
## 5 Oceania           74.3     17983.

group_by() & summarise()

  • Because the dataset is grouped, dplyr performs summarise() on each group separately
    • like each group has become its own dataset

Image taken from Software Carpentry Foundation (https://software-carpentry.org/)

group_by() & summarise() \(\rightarrow\) n() and n_distinct()

n() counts number of observations

n_distinct() counts the number of distinct observation for that column

gapminder %>% 
  group_by(continent) %>%
  summarise(n = n(), 
            n_distinct = n_distinct(country)) 
## # A tibble: 5 x 3
##   continent     n n_distinct
##   <fct>     <int>      <int>
## 1 Africa      624         52
## 2 Americas    300         25
## 3 Asia        396         33
## 4 Europe      360         30
## 5 Oceania      24          2

group_by() & add_tally()

get number of observations for each group - while keeping the entire original dataframe

Use to subset for duplicates, minimum/maximum occurrences, etc.

gapminder %>%
  group_by(country) %>%
  add_tally()
## # A tibble: 1,704 x 7
## # Groups:   country [142]
##    country     continent  year lifeExp      pop gdpPercap     n
##    <fct>       <fct>     <int>   <dbl>    <int>     <dbl> <int>
##  1 Afghanistan Asia       1952    28.8  8425333      779.    12
##  2 Afghanistan Asia       1957    30.3  9240934      821.    12
##  3 Afghanistan Asia       1962    32.0 10267083      853.    12
##  4 Afghanistan Asia       1967    34.0 11537966      836.    12
##  5 Afghanistan Asia       1972    36.1 13079460      740.    12
##  6 Afghanistan Asia       1977    38.4 14880372      786.    12
##  7 Afghanistan Asia       1982    39.9 12881816      978.    12
##  8 Afghanistan Asia       1987    40.8 13867957      852.    12
##  9 Afghanistan Asia       1992    41.7 16317921      649.    12
## 10 Afghanistan Asia       1997    41.8 22227415      635.    12
## # ... with 1,694 more rows

group_by() & filter()

group_by() comes in handy for more than just summarise()

Combine group_by() and filter() to filter the data by groups

gapminder %>% 
  group_by(continent) %>%
  filter(lifeExp == max(lifeExp) | pop == min(pop)) 
## # A tibble: 10 x 6
## # Groups:   continent [5]
##    country               continent  year lifeExp       pop gdpPercap
##    <fct>                 <fct>     <int>   <dbl>     <int>     <dbl>
##  1 Australia             Oceania    2007    81.2  20434176    34435.
##  2 Bahrain               Asia       1952    50.9    120447     9867.
##  3 Canada                Americas   2007    80.7  33390141    36319.
##  4 Iceland               Europe     1952    72.5    147962     7268.
##  5 Iceland               Europe     2007    81.8    301931    36181.
##  6 Japan                 Asia       2007    82.6 127467972    31656.
##  7 New Zealand           Oceania    1952    69.4   1994794    10557.
##  8 Reunion               Africa     2007    76.4    798094     7670.
##  9 Sao Tome and Principe Africa     1952    46.5     60011      880.
## 10 Trinidad and Tobago   Americas   1952    59.1    662850     3023.

group_by() & filter() with row_number()

Say I want the first and last occurrence of each group sprted by some element

row_number() returns the row that you specify
n() refers to the total number of observations - can be different for each group!

gapminder %>% 
  group_by(country) %>%
  arrange(year) %>%
  filter(row_number() == 1 | row_number() == n() )  %>%
  arrange(country)
## # A tibble: 284 x 6
## # Groups:   country [142]
##    country     continent  year lifeExp      pop gdpPercap
##    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
##  1 Afghanistan Asia       1952    28.8  8425333      779.
##  2 Afghanistan Asia       2007    43.8 31889923      975.
##  3 Albania     Europe     1952    55.2  1282697     1601.
##  4 Albania     Europe     2007    76.4  3600523     5937.
##  5 Algeria     Africa     1952    43.1  9279525     2449.
##  6 Algeria     Africa     2007    72.3 33333216     6223.
##  7 Angola      Africa     1952    30.0  4232095     3521.
##  8 Angola      Africa     2007    42.7 12420476     4797.
##  9 Argentina   Americas   1952    62.5 17876956     5911.
## 10 Argentina   Americas   2007    75.3 40301927    12779.
## # ... with 274 more rows

ungroup()

Grouped dataframes remained grouped

Explicitly need to ungroup() the data

gap <- gapminder %>%
  group_by(continent) %>%
  filter(lifeExp == max(lifeExp) | pop == min(pop))

summarise(gap, mean_gdpPercap = mean(gdpPercap))
## # A tibble: 5 x 2
##   continent mean_gdpPercap
##   <fct>              <dbl>
## 1 Africa             4275.
## 2 Americas          19671.
## 3 Asia              20762.
## 4 Europe            21724.
## 5 Oceania           22496.

ungroup()

Grouped dataframes remained grouped

Explicitly need to ungroup() the data

gap <- gapminder %>%
  group_by(continent) %>%
  filter(lifeExp == max(lifeExp) | pop == min(pop)) %>%
  ungroup()

summarise(gap, mean_gdpPercap = mean(gdpPercap))
## # A tibble: 1 x 1
##   mean_gdpPercap
##            <dbl>
## 1         17786.

Let's try group_by() & summarise()

gapminder %>% 
    group_by(continent) %>% 
    summarise(mean_lifeExp = mean(lifeExp),
              median_gdp = median(gdpPercap)) 
  • Name your new columns for mutate() & summarise()!
  • With %>%, there is no more input data information for each command
  • Indentation makes it easier to follow the trail of commands within the pipe
  • Column names must match EXACTLY to the dataframe
  • Observations that are strings (characters/words) need quotes
  • R is case sensitive!

Practical will build on previous verbs and the pipe!

Summary

Syntax for dplyr is more user friendly than base R

Start asking complex questions within a few lines of code

Citations

THANK YOU